2/14/2020

Data science Venn Diagram.

reference, Photo by Tobias Fischer on Unsplash

The war over super-cooled water

Water can form high- (left) and low-density amorphous ices at liquid-nitrogen temperatures. Researchers want to determine whether water can also form two distinct liquid phases at low temperature. Credit: Osamu Mishima

reference
Photo by Samuel Scrimshaw on Unsplash

The war over super-cooled water

reference
Photo by Samuel Scrimshaw on Unsplash

Growth in a time of debt

“When a country owes more than 90 percent of its GDP, it slides into recession.”

Photo by Annie Spratt on Unsplash

Growth in a time of debt

Photo by Annie Spratt on Unsplash

Growth in a time of debt

Photo by Annie Spratt on Unsplash

Debugging

Twitter fortes quote

Elements of robust code

  • Will not break easily under changes
  • Can be refactored simply
  • Can be extended without breaking other parts of the program
  • can be tested

Testing paradigms

Example packages

  • Examples of packages from the tidyverse
  • Testthat
    • Provides functions to describe what you expect something to do
    • Easy integration with different projects
  • Assertr
    • Suite of functions for checking assumptions about data
    • Can be used to spot mistakes quickly and early
Photo by NASA on Unsplash

Motivating example

Imagine we have the following data:

date lab_test temp infections
2020-02-01 7 3.0 17
2020-02-02 7 6.9 14
2020-02-03 7 14.3 14
2020-02-04 4 11.5 14
2020-02-05 4 2.9 12
  • What columns do we expect?
  • Should the days be consecutive? Should they be unique?
  • What ranges should data be? Would there be any that are implausible?

Creating a data processing function

process_data <- function(data){
  data %>% 
    # verify all columns are present
    (function(x){x}) %>%
    # verify numbers are non-negative
    (function(x){x}) %>%
    # verify all dates are unique
    (function(x){x})
}

Creating a data processing function

process_data <- function(data){
  data %>% 
    # verify all columns are present
    verify(has_all_names("date","lab_test","temp","infections"),
           error_fun=just_warn) %>%
    # verify numbers are non-negative
    (function(x){x}) %>%
    # verify all dates are unique
    (function(x){x})
}

Creating a data processing function

process_data <- function(data){
  data %>% 
    # verify all columns are present
    verify(has_all_names("date","lab_test","temp","infections"),
           error_fun=just_warn) %>%
    # verify numbers are non-negative
    assert(within_bounds(0,Inf),c("infections","temp","lab_test"),
           error_fun=just_warn) %>%
    # verify all dates are unique
    (function(x){x})
}

Creating a data processing function

process_data <- function(data){
  data %>% 
    # verify all columns are present
    verify(has_all_names("date","lab_test","temp","infections"),
           error_fun=just_warn) %>%
    # verify numbers are non-negative
    assert(within_bounds(0,Inf),c("infections","temp","lab_test"),
           error_fun=just_warn) %>%
    # verify all dates are unique
    assert(is_uniq,date,error_fun=just_warn)
}

How do we know our function works?
Photo by Ana-Maria Berbec on Unsplash

Test function returns a dataframe

process_data(data) %>%
  head(n=5)
## # A tibble: 5 x 4
##   date       lab_test  temp infections
##   <chr>         <dbl> <dbl>      <dbl>
## 1 2020-02-01        7   3           17
## 2 2020-02-02        7   6.9         14
## 3 2020-02-03        7  14.3         14
## 4 2020-02-04        4  11.5         14
## 5 2020-02-05        4   2.9         12

Test function doesn’t allow non-unique dates

cleaned_data <- data %>% 
  add_row(date="2020-02-01",lab_test=1,
          temp=20.,infections=17) %>% 
  process_data()
## Column 'date' violates assertion 'is_uniq' 2 times
##     verb redux_fn predicate column index      value
## 1 assert       NA   is_uniq   date     1 2020-02-01
## 2 assert       NA   is_uniq   date    21 2020-02-01
## Warning: assertr encountered errors

Test function doesn’t allow negative numbers

cleaned_data <- data %>% 
  add_row(date="2020-02-21",lab_test=-1,
          temp=20.,infections=17) %>% 
  process_data()
## Column 'lab_test' violates assertion 'within_bounds(0, Inf)' 1 time
##     verb redux_fn             predicate   column index value
## 1 assert       NA within_bounds(0, Inf) lab_test    21    -1
## Warning: assertr encountered errors

Writing tests

  • Create a tests/ folder
  • Within the folder create a script with name test_process_data.R
  • Source script at top of file then begin writing tests
test_that("process data", {
  expect_type(process_data(data), "list")
})

Writing tests

test_that("date not unique", {
  expect_warning(
    data %>% 
    add_row(date="2020-02-20",
            lab_test=1,
            temp=20.,
            infections=17) %>% 
    process_data() 
    )
    
})
## Column 'date' violates assertion 'is_uniq' 2 times
##     verb redux_fn predicate column index      value
## 1 assert       NA   is_uniq   date    20 2020-02-20
## 2 assert       NA   is_uniq   date    21 2020-02-20

Writing tests

test_that("Negative numbers", {
  expect_warning(
    data %>% 
      add_row(date="2020-02-21",
              lab_test=-1,
              temp=20.,
              infections=17) %>% 
      process_data() 
  )
})
## Column 'lab_test' violates assertion 'within_bounds(0, Inf)' 1 time
##     verb redux_fn             predicate   column index value
## 1 assert       NA within_bounds(0, Inf) lab_test    21    -1

Running tests

testthat::test_dir("example_code/tests")
## v |  OK F W S | Context
## 
/ |   0       | test_process_dataColumn 'date' violates assertion 'is_uniq' 2 times
##     verb redux_fn predicate column index      value
## 1 assert       NA   is_uniq   date    20 2020-02-20
## 2 assert       NA   is_uniq   date    21 2020-02-20
## 
## Column 'lab_test' violates assertion 'within_bounds(0, Inf)' 1 time
##     verb redux_fn             predicate   column index value
## 1 assert       NA within_bounds(0, Inf) lab_test    21    -1
## 
## 
v |   3       | test_process_data
## 
## == Results =======================================================================================
## OK:       3
## Failed:   0
## Warnings: 0
## Skipped:  0
## 
## You rock!

Wrapping up

expect(umbrellaOpens).toBe(true)

tests: 1 passed, 1 total